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1 Introduction 

Despite its widespread acceptance, there are some problems in using proba- 
bility to represent uncertainty. Perhaps the most serious is that probability 
is not good at representing ignorance. The following two examples illustrate 
the problem. 

Example 1.1: Suppose that a coin is tossed once. There are two possible 
worlds, h and t, corresponding to the two possible outcomes. If the coin is 
known to be fair, it seems reasonable to assign probability 1/2 to each of 
these worlds. However, suppose that the coin has an unknown bias (where 
the bias of a coin is the probability that it lands heads.) How should this be 



*The material in this chapter is taken, often verbatim, from [Halpern 2003], which the 
reader is encouraged to consult for further details and references. 

tSupported in part by NSF under grants CTC-0208535, ITR-0325453, and IIS-0534064, 
by ONR under grants N00014-00-1-03-41 and N00014-01-10-511, by the DoD Multidisci- 
plinary University Research Initiative (MURI) program administered by the ONR under 
grants N00014-01-1-0795 and N00014-04-1-0725, and by AFOSR under grant F49620-02- 
1-0101. 



1 



represented? One approach might be to continue to take heads and tails as 
the elementary outcomes and, applying the principle of indifference, assign 
them both probability 1/2, just as in the case of a fair coin. However, there 
seems to be a significant qualitative difference between a fair coin and a coin 
of unknown bias. This difference has some pragmatic consequences. For ex- 
ample, as Kyburg (e.g., in [Kyburg 1961]) has pointed out, the assumption 
that heads and tails have probability 1/2, together with the assumption that 
consecutive coin tosses are independent implies that, if the coin is tossed 
1,000,000 times, then the probability that the coin will land heads some- 
where between 498,000 and 502,000 times is greater than .999. This certainly 
doesn't seem something that an agent who has no idea of the bias of the coin 
should know! | 

Example 1.2: Suppose that a bag contains 100 marbles; 30 are known to 
be red, and the remainder are known to be either blue or yellow, although 
the exact proportion of blue and yellow is not known. What is the likelihood 
that a marble taken out of the bag is yellow? This can be modeled with three 
possible worlds, red, blue, and yellow, one for each of the possible outcomes. 
It seems reasonable to assign probability .3 to the outcome to choosing a red 
marble, and thus probability .7 to choosing either blue or yellow, but what 
probability should be assigned to the other two outcomes? 

Empirically, it is clear that people do not use probability to represent the 
uncertainty in this example. For example, consider the following three bets. 
In each case a marble is chosen from the bag. 

• B r pays $1 if the marble is red, and otherwise; 

• Bb pays $1 if the marble is blue, and otherwise; 

• By pays $1 if the marble is yellow, and otherwise. 

People invariably prefer B r to both Bb and B y , and they are indifferent 
between B b and B y . The fact that they are indifferent between B b ad B y 
suggests that they view it equally likely that the marble chosen is blue and 
that it is yellow. This seems reasonable; the problem statement provides no 
reason to prefer blue to yellow, or vice versa. However, if the probability of 
drawing a red marble is taken to be .3, then the probability of drawing a blue 
marble and that of drawing a yellow marble are both .35, which suggests that 
By and Bb should both be preferred to B r . 

Moreover, now consider the following three bets: 
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• B ry pays $1 if the marble is red or yellow, and otherwise; 

• B by pays $1 if the marble is blue or yellow, and otherwise. 

While most people prefer B r to B b , most also prefer B by to B ry . There is no 
probability measure on {b, r, y} that would both make b more likely than r 
and make {b,r} less likely than {b,y}. (This is essentially Ellsberg's [1961] 
paradox; I return to this issue in Section 7.) I 

One natural way of representing uncertainty in both of these cases is 
by using a set of probability measures, rather than a single measure. For 
example, the uncertainty in Example 1.1 can be represented by the set 
V m = {na '■ a £ [0)1]} °f probability measures on {h,t}, where fi a gives 
h probability a. In Example 1.2, the uncertainty can be represented using 
the set V u = {/i' a : a G [0, .7]} of probability measures on {red, blue, yellow}, 
where /i' a gives red probability .3, blue probability a, and yellow probability 
.7 — a. 

In the rest of this paper, I explore the use of sets of probability measures 
as a representation of uncertainty. 

2 Lower and Upper Probability and Dutch 
Book Arguments 

Let V be a set of probability measures all defined on all subsets of a finite 
set W of possible worlds. 1 Given a set X of real numbers, let supX, the 
supremwm (or just sup) of X, be the least upper bound of X — the smallest 
real number that is at least as large as all the elements in X. That is, 
sup X = a if x < a for all x G X and if, for all a' < a, there is some x G X 
such that x > a'. For example, if X = {1/2,3/4,7/8,15/16,...}, then 
supX = 1. Similarly, inf X, the infimum (or just inf) of X, is the greatest 
lower bound of X — the largest real number that is less than or equal to every 
element in X. For U QW, define 

V{U) = {pi(U) : n G V}, 
V*(U) =infP([/), and 
V*{U) = supV(U). 

1 Thc assumptions that W is finite and that every subset of W is measurable, that is, 
in the domain of every probability measure fi £ V, are made for ease of exposition only. 
They can both easily be dropped. 
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V*(U) is called the lower probability of U, and V*(U) is called the upper 
probability of U. If V*(U) = V*(U) for all subsets U of W, then it is easy 
to see that V must be a singleton {//}, and V* = V* = fi. In general, of 
course, V* ^ V*. For a set £/, the difference V*(U) —V*(U) can be viewed as 
characterizing our ignorance about U. In Example 1.2, there is uncertainty 
about the likelihood of red being chosen, but there is no ignorance: the 
likelihood is exactly .3. This is captured by P 2 : {V2)*{red) = (V2)*{red) = 
.3. On the other hand, there is ignorance about the likelihood of blue and 
yellow being chosen. And, indeed, (J > 2)*{blue) = and (V2)*(blue) = .7, and 
similarly for yellow. 

While lower and upper probabilities seem natural, how reasonable is it to 
use them to represent uncertainty? I investigate this question in a number of 
different contexts in the next few sections. For now, I briefly consider one of 
the most prominent justifications for probability, the Dutch book argument, 
which goes back to Ramsey [1931] and de Finetti [1931, 1937], and see how 
it fares in the context of sets of probabilities. 

Roughly speaking, the Dutch book argument says that if odds do not act 
like probabilities, then there is a collection of bets that guarantees a sure 
loss. Somewhat more precisely, suppose that an agent must post odds for 
each subset of a set W. If the agent chooses odds of, say, 4:5 on U C W, 
then this is supposed to mean that the agent is willing to accept a bet of any 
size for or against U. If a bookie bets %k on U, then if U happens (i.e., if 
the actual world is in U — it is assumed that this can always be determined), 
then the bookie wins $9/4/c; if not, the bookie loses the %k. Similarly, if 
the bookie bets %k against U , then if the U happens, the bookie loses the 
$k, and if not, then the bookie wins $9/5fc. In general, if the odds for U 
are 0\ : o 2 , then if a bookie bets $k on U, then he wins (oi + o 2 )/o 1 if U 
happens and loses the $k otherwise, and if he bets against U, he loses $k if 
U happens and wins (oi + 02)/o 2 otherwise. If the odds on U are o\ : 02, let 
Pu be o\j{p\ + 02). The key claim is that, unless the numbers pu act like 
probabilities (and, in particular, pw = 1 and puuv = Pu +Pv if t/ and V are 
disjoint), then the agent is irrational: there is a Dutch book, a collection of 
bets which guarantee a loss for the agent. Conversely, if the pu's do act like 
probabilities, then there is no Dutch book. 

Does this mean it is irrational to use other representations of uncertainty, 
such as sets of probability measures? Many problems have been noted with 
Dutch book arguments (see, for example, [Howson and Urbach 1989, pp. 89- 
91], [Hajek 2007]). Of most relevance here is the implicit assumption that 
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an agent can or is willing to post fair odds, that is, odds for which he is 
indifferent between a bet for and against a subset U of W. In the stock 
market, bid and ask prices are not necessarily equal. Suppose that instead of 
posting fair odds, the agent were only willing to post the analogue of bid and 
ask prices; odds for which he is willing to take a bet on U and (lower) odds at 
which he is willing to take a bet against U . In that case, arguments similar 
in spirit to those used by de Finetti and Ramsey can be used to show that 
the agent is rational iff his odds determine lower and upper probabilities (see 
[Smith 1961; Williams 1976]). The key to making these arguments precise is 
a characterization of lower and upper probabilities, which is the subject of 
the next section. 

3 Charaterizing Lower and Upper Probabil- 
ity 

A probability measure on If is a function \i : 2 W — > [0, 1] characterized by 
two well-known properties: 

PI. n(W) = 1. 

P2. n(Ui U U2) = l^(Ui) + niU-i) if Ui and U2 are disjoint subsets of W. 

Every probability measure satisfies PI and P2, and every function from fi : 
2 W — > [0, 1] satisfying PI and P2 is a probability measure. Property P2 is 
known as (finite) additivity; note that the fact that /x(0) = follows easily 
from P2; PI and P2 together imply that fi(U) = 1 - //([/). 

Are there similar properties characterizing lower and upper probabilities? 
It is easy to see that PI continues to hold for both lower and upper proba- 
bilities. P2 does not hold, but lower probability is superadditive and upper 
probability is subadditive, so that for disjoint sets U and V, 

V*(U UV)> V,{U) + V*(V), and 

V*(UUV) <V*(U)+V*(V). [ ) 

In addition, the relationship between lower and upper probability is defined 
by 

V.(U) = 1-V*(U). (2) 
(I leave the straightforward proof of these results to the reader.) 



5 



While (1) and (2) hold for all lower and upper probabilities, these proper- 
ties do not completely characterize them. For example, the following property 
holds for lower and upper probabilities if U and V are disjoint: 

V*(UUV) <V*(U)+V*(V) <V*(UUV); (3) 

moreover, (3) does not follow from (1) and (2) [Halpern and Pucella 2002a]. 
However, even adding (3) to (1) and (2) does not provide a complete char- 
acterization of lower and upper probabilities. The property needed to get 
a complete characterization is somewhat complex. To state it precisely, say 
that a set U of subsets of W covers a subset UofW exactly k times if every 
element of U is in exactly k sets in U. Consider the following property: 

YiU = {Ui, . . . ,Uk} covers U exactly m + n times and covers U 
exactly m times, then Yn=i V*(Ui) <m + riP*{U). 

(There is of course an analogous property for upper probability, with < 
replaced by >.) It is not hard to show that lower probabilities satisfy (4) and 
that (1) and (3) follow from (4) and (2). Indeed, in a precise sense, as Anger 
and Lembcke [1985] show, (4) completely characterizes lower probabilities 
(and hence, together with (2), upper probabilities as well). 

Theorem 3.1: [Anger and Lembcke 1985] Lower probability satisfies (4). 
Conversely, if f : 2 W — > [0,1] satisfies (4) (with V* replaced by f) and 
f(W) = 1, then there exists a setV of probability measures such that f = V*. 2 

Although I have been focusing on lower and upper probability, it is impor- 
tant to stress that sets of probability measures contain more information than 
is captured by their lower and upper probability, as the following example 
shows. 

Example 3.2: Consider two variants of Example 1.2. In the first, all that 
is known is that there are at most 50 yellow marbles and at most 50 blue 
marbles in a bag of 100 marbles; no information at all is given about the 
number of red marbles. In the second case, it is known that there are exactly 

2 Besides the characterization of Anger and Lembcke given in Theorem 3.1, a number 
of other characterizations of lower and upper probability have been given in the literature, 
all similar in spirit [Giles 1982; Huber 1976; Huber 1981; Lorentz 1952; Williams 1976; 
Wolf 1977]. 
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as many blue marbles as yellow marbles. The first situation can be captured 
by the set V 3 = {// : fi(blue) < .5, fi(yellow) < .5}. The second situation can 
be captured by the set V4 = {fi : fi(b) = fi{y)}- These sets of measures are 
obviously quite different; in fact V4 is a strict subset of V 3 . However, it is 
easy to see that (P3)* = {Va)* and, hence, that V 3 — V%. Thus, the fact 
that blue and yellow have equal probability in every measure in P 4 has been 
lost by considering only lower and upper probability. I return to this issue 
in Section 6. I 

4 Dempster-Shafer Belief Functions as Lower 
Probabilities 

The Dempster-Shafer theory of evidence, originally introduced by Arthur 
Dempster [1967, 1968] and then developed by Glenn Shafer [1976], provides 
another approach to attaching likelihoods to events. This approach starts 
out with a belief function (sometimes called a support function). Given a 
set W of possible worlds and U C W, the belief in U, denoted Bel(£7), is a 
number in the interval [0, 1]. A belief function Bel defined on a space W 
must satisfy the following three properties: 

Bl. Bel(0) = 0. 

B2. Bel(W) = 1. 

B3. BeKUJUE/i) > £?=i E { /c { i,...,„}:|/| =J }(-l) m Bel(n, e/ ^), for n = 1, 2, 3, . . .. 

Bl and B2 just say that, like probability measures, belief functions follow 
the convention of using and 1 to denote the minimum and maximum like- 
lihood. B3 is closely related to the inclusion- exclusion rule for probability. 
The inclusion-exclusion rule is used to compute the probability of the union 
of (not necessarily disjoint) sets. In the case of two sets U and V, the rule 
says 

^(u uv) = n{u) + n(v) - n{u n v). 

In the case of three sets U\, U 2 , U 3 , similar arguments show that 

up! u u 2 u u 3 ) = 

MfA) + (x(u 2 ) + fi(u 3 ) - ^(u, n u 2 ) - MtA n u 3 ) - ^(u 2 n u 3 ) + ^{u, nu 2 n u 3 ). 
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That is, the probability of the union of U\, U 2 , and C/3 can be determined 
by adding the probability of the individual sets (these are one-way intersec- 
tions), subtracting the probability of the two-way intersections, and adding 
the probability of the three-way intersections. The generalization of this rule 
to k sets, with = replaced by >, is just B3. It follows that every probability 
measure is a belief function. 

If U and V are disjoint sets, then it easily follows from Bl and B3 that 
Bel(U UV) > Bel(U) + Bel(V). That is, Bel is superadditive, just like a 
lower probability. And just like a lower probability, Be\(U) can be viewed 
as providing a lower bound on the likelihood of U. Define Plaus (f7) = 1 — 
Be\(U). Plaus is a plausibility function; Plaus(£7) is the plausibility of U. 
A plausibility function bears the same relationship to a belief function that 
upper probability bears to lower probability. 

By B2 and B3, for all subsets UQW,1 = Be\{W) > Be\(U) + Bel(F), 

so 

Plaus(fT) = 1 - Bel(ZJ) > Be\(U). 

Thus, for an event U, the interval [Bel(£7), Plaus(?7)] can be viewed as describ- 
ing the range of possible values of the likelihood of U, just like [P*(U), V*(U)}. 

There is in fact a deeper connection between belief functions and lower 
probabilities: every belief function is a lower probability and the correspond- 
ing plausibilty function is the corresponding upper probability. 

Theorem 4.1: Given a belief function Bel defined on a space W, let Vb c i = 
{/i : fi(U) > Bel(U) for all U C W}. Then Bel = (P Be i)* and Plaus = 
(Peel)*- 

The converse of Theorem 4.1 does not follow, as the following example 
shows. 

Example 4.2: Suppose that W = {a, b, c, d}, V = {/ii, /i2}, = Hi(b) = 

/ii(c) = fii(d) = 1/4, and ^{a) = /^(c) = 1/2 (so that ^(b) = ^{d) = 0). 
Let U\ = {a, b} and U2 = {b, c}. It is easy to check that V*(Ui) = V^U^ = 
1/2, V*(U! U U 2 ) = 3/4, and n U 2 ) = 0. V* thus cannot be a belief 

function, because it violates B3: 

V*{U X U U 2 ) < V*(U X ) + V*(U 2 ) - P*(£/i n U 2 ). 

I 
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Thus, lower probabilities are a strictly more expressive representation of 
uncertainty than belief functions. 

I remark that while belief functions can be understood (to some extent) 
in terms of lower probability, this is not the only way of understanding them. 
Shafer, for example, views belief functions as a way of representing evidence; 
see [Halpern and Fagin 1992] for a discussion of these two ways of under- 
standing belief functions. 

5 Updating Sets of Probabilities 

Suppose that an agent's uncertainty is defined in terms of a set V of proba- 
bility measures. How should the agent update his beliefs in light of observing 
an event U? The obvious thing to do is to condition each member of V on 
U. This suggests that after observing U, the agent's uncertainty should be 
represented by the set {fi\U : fi G V} (where fi\U is the conditional probabil- 
ity measure that results by conditioning /i on U). There is one obvious issue 
that needs to be addressed: What happens if fx(U) = for some /j, G VI 
There are two choices here: either to say that conditioning makes sense only 
if fi(U) > for all /i G V (i.e., if V*(U) > 0) or to consider only those mea- 
sures n for which /j,(U) > 0. The latter choice is somewhat more general, so 
that is what I use here. Thus, I define 

V\U = {fi\U : fj, G V, fi{U) > 0}. 

Once the agent has a set V\ U of conditional probability measures, it is pos- 
sible to consider lower and upper conditional probabilities. However, note 
that the lower and upper conditional probabilities are not determined by the 
lower and upper probabilities, as the following example shows. 

Example 5.1: Let P 3 and Va be the sets of probability measures constructed 
in Example 3.2. As was already observed, (V3)* = (Va)* (and so (V3)* = 
{Va)*). But (P 3 )*(& I {b,y}) = 0, while (P 4 )*(& I fry}) = 1/2. Thus, 
even though the upper and lower probability determined by V3 and Va are 
the same, the upper and lower probabilities determined by Vz\{b,y} and 
7^4 1 {&, y} are not. | 

The following example gives a sense of how conditioning works with sets 
of probabilities. 
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Example 5.2: The three-prisoners is the following old puzzle, which is dis- 
cussed, for example, by Mosteller [1965] and Gardner [1961]: 

One of three prisoners, a, b, and c, has been chosen by a fair 
lottery to be pardoned, while the other two will be executed. 
Prisoner a does not know who has been pardoned; the jailer does. 
Thus, a says to the jailer, "Since either b or c is certainly going 
to be executed, you will give me no information about my own 
chances if you give me the name of one man, either b or c, who 
is going to be executed." Accepting this argument, the jailer 
truthfully replies, u b will be executed." Thereupon a feels happier 
because before the jailer replied, his own chance of execution was 
2/3, but afterward there are only two people, himself and c, who 
could be the one not executed, and so his chance of execution is 
1/2. 

It seems that the jailer did not give a any new relevant information. Is a 
justified in believing that his chances of avoiding execution have improved? 
If so, it seems that a would be equally justified in believing that his chances 
of avoiding execution would have improved if the jailer had said "c will be 
executed." Thus, it seems that a's prospects improve no matter what the 
jailer says! That does not seem quite right. 

Conditioning is implicitly being applied here to a space consisting of three 
worlds — say w a , w^, and w c — where in world w x , prisoner x is pardoned. But 
this representation of a world does not take into account what the jailer 
says. A better representation of a possible situation is as a pair (x, y), where 
x, y G {a, b, c}. Intuitively, a pair (x, y) represents a situation where x is par- 
doned and the jailer says that y will be executed in response to a's question. 
Since the jailer answers truthfully, x ^ y; since the jailer will never tell a 
directly that a will be executed, y ^ a. Thus, the set of possible worlds is 
{(a, 6), (a, c), (6, c), (c, 6)}. The event lives-a — a lives — corresponds to the set 
{(a, 6), (a, c)}. Similarly, the events lives-b and lives-c correspond to the sets 
{(6, c)} and {(c, 6)}, respectively. By assumption, each prisoner is equally 
likely to be pardoned, so that each of these three events has probability 1/3. 

The event says-b — the jailer says b — corresponds to the set {(a, 6), (c, 6)}; 
the story does not give a probability for this event. The event {(c, 6)} (lives- 
c) has probability 1/3. But what is the probability of {(a, 6)}? That depends 
on the jailer's strategy in the one case where he has a choice, namely, when a 
lives. He gets to choose between saying b and c in that case. The probability 
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of (a, b) depends on the probability that he says b if a lives; that is, on 
fi(says-b \ lives-a). 

If the jailer chooses at random between saying b and c if a is pardoned, 
so that n(says-b \ lives-a) = 1/2, then /i({(a,b)}) = /i({(a,c)}) = 1/6, and 
/i(says-b) = 1/2. With this assumption, 

^(lives-a \ says-b) = fi(lives-an says-b) / (i(says-b) = (1/6) /(1/2) = 1/3. 

Thus, if fi(says-b) = 1/2, the jailer's answer does not affect a's probability. 

Suppose more generally that /j, a , < a < 1, is the probability measure 
such that n a (lives-a) = fi a (lives-b) = fi a (lives-c) = 1/3 and fi a (says-b | 
lives-a) = a. Then straightforward computations show that 

fi a ({(a, b)}) = fj, a (lives-a) x fi a (says-b \ lives-a) = a/3, 
fi a (says-b) = fi a ({(a, b)}) + /i a ({(c, b)}) = (a + l)/3, and 
fj, a (lives-a | says-b) = j^ff^ = a/(a + 1). 

Thus, /ii/2 = /i. Moreover, if a ^ 1/2 (i.e., if the jailer had a particular 
preference for answering either b or c when a was the one pardoned), then 
a's probability of being executed would change, depending on the answer. 
For example, if a — 0, then if a is pardoned, the jailer will definitely say 
c. Thus, if the jailer actually says 6, then a knows that he is definitely not 
pardoned, that is, ^(iroes-a | says-b) = 0. Similarly, if a — 1, then a knows 
that if either he or c is pardoned, then the jailer will say 6, while if b is 
pardoned the jailer will say c. Given that the jailer says 6, from a's point 
of view the one pardoned is equally likely to be him or c; thus, fii(lives-a \ 
says-b) = 1/2. In fact, it is easy to see that if Vj = {fi a : a G [0, 1]}, then 
(Vj\says-b)*(liv es-a) = and {V j\says-b)* {lives-a) = 1/2. 

To summarize, the intuitive answer — that the jailer's answer gives a no 
information — is correct if the jailer applies the principle of indifference in the 
one case where he has a choice in what to say, namely, when a is actually 
the one to live. If the jailer does not apply the principle of indifference in 
this case, then a may gain information. On the other hand, if a does not 
know what strategy the jailer is using to answer (and is not willing to place a 
probability on these strategies), then his prior point probability of 1/3 dilates 
to the interval [0,1/2]. | 

As Seidenfeld and Wasserman [1993] have shown, the dilation phenomenon 
observed in this example, where the prisoner's ignorance after hearing the 
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jailer's answer goes from — initially a knew that the probability of him being 
executed was 1/3 — to 1/2, no matter what the jailer says, is quite general. 
Nevertheless, it is easy to see where the dilation is coming from here, and 
it is arguably acceptable. (Although, as shown by Griinwald and Halpern 
[2004] , there may be circumstances when working with sets of probabilities 
under which it is most appropriate to ignore new information and just work 
with the prior probability.) A perhaps more significant problem with this 
approach to conditioning on sets of probabilities is that it does not always 
seem to capture learning, as the following example shows. 

Example 5.3: Suppose that a coin is tossed twice and the first coin toss 
is observed to land heads. What is the likelihood that the second coin toss 
lands heads? In this situation, the sample space consists of four worlds: 
hh, ht, th, and tt. Let H 1 = {hh, ht} be the event that the first coin toss 
lands heads. There are analogous events H 2 , T 1 , and T 2 . Further suppose 
that all that is known about the coin is that its bias is either a or b, where 
< a < b < 1. The most obvious way to represent this seems to be with 
a set of probability measures V = {fi a ,Hb}- 3 Further suppose that the coin 
tosses are independent, so that, in particular, fj, a (hh) = ii a {H l )jj, a {H 2 ) = a 2 
and that [i a (ht) = [i a (H l )[i a (T 2 ) = a — a 2 for a E {a, b}. 

Using the definitions, it is immediate that V\H l (H 2 ) = {a,b} = V(H 2 ). 
At first blush, this seems reasonable. Since the coin tosses are independent, 
observing heads on the first toss does not affect the likelihood of heads on 
the second toss; it is either a or b, depending on what the actual bias of the 
coin is. However, intuitively, observing heads on the first toss should also 
give information about the coin being used: it is more likely to be the coin 
with bias b. This point perhaps comes out more clearly if a = 1/3, b = 2/3, 
the coin is tossed 100 times, and 66 heads are observed in the first 99 tosses. 
What is the probability of heads on the hundredth toss? Formally, using the 
obvious notation, the question now is what VKH 1 Pi ... Pi if") (if 100 ) should 
be. According to the definitions, it is again {1/3,2/3}: the probability is 
still either 1/3 or 2/3, depending on the coin used. But the fact that 66 of 

3 Some researchers working with probability restrict to sets V of probability measures 
that are convex. That is, if /i and // are both in V, then so is the probability measure 
afi+(l— a)n' for all a in the interval [0, 1] (where (ayu+(l — a)/i')(U) = a/j,(U) + (l—a)[i'; it 
is easy to check that a/i+(l — a)fi' is a probability measure). I do not make this restriction 
here, but it is worth noting that nothing would be lost in this example by taking V to be 
the convex set consisting of all probability meausures fj, such that a < fi(h) < b. 
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99 tosses landed heads provides extremely strong evidence that the coin has 
bias 2/3 rather than 1/3. This evidence should make it more likely that the 
probability that the last coin will land heads is 2/3 rather than 1/3. The 
conditioning process does not capture this evidence at all. I 

The inability of this approach to conditioning with sets of probabilities to 
capture learning is perhaps its most serious weakness. Note that this really 
is a problem confined to sets of probabilities. If there is a probability on the 
possible biases of the coin, then all these difficulties disappear. In this case, 
the sample space must represent the possible biases of the coin, so there are 
eight worlds: (a, hh), ((5, hh), (a, ht), ((5, ht), . . .. Moreover, if the probability 
that it has bias a is p (so that the probability that it has bias f3 is 1 — p), 
then the uncertainty is captured by a single probability measure \x such that 
//(a, hh) = pa 2 , hh) = (l—p)b 2 , and so on. A straightforward calculation 
shows that ^(H 1 ) = fi(H 2 ) = pa + (1 -p)b and ^(H 1 f]H 2 ) = pa 2 + (1 -p)b 2 , 
so /2(H 2 | H 1 ) = (pa 2 + (1 - p)b 2 )/(pa + (1 - p)b). With a little calculus, it 
can be shown that [i(H 2 \ H 1 ) = (pa 2 + (1 — p)b 2 )/(pa + (1 — p)b) > [i(H 2 ), 
no matter what a and b are, with equality holding iff a = or a = 1. 

Intuitively, seeing H 1 makes H 2 more likely than it was before, despite the 
fact the coin tosses are independent, because seeing H 2 makes the coin more 
biased towards heads more likely to be the actual coin. This intuition can be 
formalized in a straightforward way. Let Cb be the event that the coin has bias 
b (so that Cb consists of the four worlds of the form (b, . . .)). Then fi(Cb) = 
1 —p by assumption, while [i(Cb \ H 1 ) = (1 — p)b/(pa+ (1 — p)b) > 1— p, with 
equality holding iff p is either or 1 (since otherwise b/(pa — (1 — p)b) > 1). 
Similarly, if fi(H 2 \ Hi) > fi(H 2 ), with equality holding iff p is either or 1. 

Interestingly, if the bias of the coin is either or 1 (i.e., the coin is either 
double-tailed or double-headed, so that a = and 6=1), then the evidence 
is taken into account. In this case, after seeing heads, /i is eliminated, so 
V\H l (H 2 ) = 1 (or, more precisely, {1}), not {0, 1}. On the other hand, if the 
bias is almost or almost 1, say .005 or .995, then VlH 1 ^ 2 ) = {.005, .995}. 
Thus, although the evidence is taken into account in the extreme case, where 
the probability of heads is either or 1, it is not taken into account if the 
probability of heads is either slightly greater than or slightly less than 1. 

This observation suggests a modification of the conditioning process that 
lets us capture learning. In Example 5.3, the implicit assumption is that 
there is a true bias of the coin, either a or b, which the agent would like 
to learn. Given an observation, the maximum likelihood approach, which 
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is standard in statistics, would essentially use the probability measure that 
gave the highest probability to the observation from then on. Since a < b 
by assumption, after observing heads, we would use fib for making future 
predictions, while after observing tails, we would use /i a . 

The conditioning approach considered so far uses all probability measures 
except those that give probability to the observation. An intermediate 
approach between these extremes is to consider only probability distributions 
that are within some parameter q of the maximum probability that U gets. 
Formally, for < q < 1, define 

V q \U = {»\U:iJieV,qP*(U)<»(U)}. 

The maximum likelihood approach is a special case of this approach with 
q — 1. V\U as defined earlier, is essentially the case where q — 0, except that 
< is replaced by <. 

Intuitively, q can be viewed as describing how "conservative" the agent is; 
the smaller q is, the more conservative the agent. Note that, for any choice 
of q, learning takes place. For example, if we take V to consist of all the 
probability measures \i a with a G [1/3,2/3] (so that the agent considers the 
bias of the coin to be somewhere between 1/3 and 2/3), and the true bias is 
b G [1/3,2/3], then for any choice of q and e, the agent will (with extremely 
high probability) converge to considering possible only distributions fi c with 
c G [b — e, b + e]. The larger q is, the faster the learning (but the greater 
the likelihood of making mistakes by perhaps ignoring a probability measure 
inappropriately) . 4 

6 Lower and Upper Expectation 

In the context of probability and betting games, how much an agent can 
expect to win is defined in terms of expectation. 

A gamble X on W is a function from W to the reals. 5 As is standard in 
the literature, if a; is a real number, take X = x to be the subset of W which 
X maps to x, that is, X = x is the subset {w : X(w) = x}. 

4 Although the idea of using a parameter q to do the updating is quite natural, I have 
seen it in print only in the work of Epstein and Schneider [2005] , who use it in the context 
of decision making. 

5 A gamble is just a random variable whose range is the reals. 
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The expected value of X with repect to probability measure //, denoted 
E»(X), is just 

X 

For example, suppose that the agent bets $1 and will win $3 if U happens 
and lose his dollar if U does not happen. We can characterize this bet by the 
gamble B = 5Xjj — Xjj, where, for an arbitrary subset V of W, Xy{w) = 1 
if w G V and X v (w) — if w ^ U. (X v is called the indicator function for 
V) 

If n(U) = 1/3, then the agent expects to win $5 with probability 1/3, 
and to lose $1 with probability 2/3. The expected value of this bet is 

W = ^5+^x(-l) = l. 

This seems like an intuitively reasonable characterization of the agent's ex- 
pected winnings, provided that his uncertainty is given by the probability 
measure //. 

Probabilistic expectation is characterized by some well-known properties. 
To make them precise, if X and Y are gambles on W and a and b are 
real numbers, define the gamble aX + bY on W in the obvious way: (aX + 
bY){w) = aX(w) + bY{w). Say that X < Y if X(w) < Y{w) for all w G W . 
Let c denote the constant function that always returns c; that is, c(w) = c. 

Proposition 6.1: The function has the following properties for all gam- 
bles X and Y . 

(a) is additive: E^X + Y) = E^X) + E^Y). 

(b) E^ is affinely homogeneous: E^aX + b) = aE /Ji (X) + b for all a,b G 1R. 

(c) E^ is monotone: if X <Y, then E^X) < E^Y). 

The properties in Proposition 6.1 essentially characterize probabilistic 
expectation. 

Proposition 6.2: Suppose that E maps gambles on W to M and E is ad- 
ditive, affinely homogeneous, and monotone. Then there is a (necessarily 
unique) probability measure /i onW such that E = E^. 
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Now suppose that uncertainty is represented by a set V of probability mea- 
sures, rather than a single probability measure. Define E-p(X) = {E^X) : 
H G V}. E-p(X) is a set of numbers. We can use E-p to define obvious 
analogues of lower and upper probability. Define the lower expectation and 
upper expectation of X with respect to V, denoted E V (X) and E-p(X), as 
the inf and sup of the set E-p(X), respectively. 

Just as lower probability determines upper probability (and vice versa), 
so lower expectation determines upper expectation. It is not hard to show 
that 



We can recover lower and upper probability from lower and upper expec- 
tation. It is easy to check that E v (Xu) = V*(U) and Ep(Xu) = V*(U), 
where X v is the indicator function for U defined earlier. The converse is 
not true; lower and upper probability do not determine lower and upper 
expectation. 

Example 6.3: Again, consider the sets V3 and V4 of probability measures 
defined in Example 3.2. As observed earlier, (P3)* = ("P/i)*, and so (V3)* = 
(V4)*. However, if Y is the random variable Xs^ — X{ y y, then E_p 4 (Y) = 
EiVijiY) = (since ji{b) = /i(y) for all probability measures in V4), while 



Thus, lower (and upper) expectation can make finer distinctions than 
lower and upper probability. (Note that this is not the case for probability: fi 
determines and vice versa.) Morever, the lower expectation corresponding 
to a set V of probability measures essentially determines V. 

To make this precise, recall that a set V of probability measures on W 
is convex if, for all //,//' G V and a G [0, 1], the probability measure afi + 
(1 — a)fi' is also in V. V is closed if it contains its limits. That is, for all 
sequences /ii, /i 2 , . . . of probability measures in V, if \i n — > fi in the sense that 
[i n {U) —>//([/) for all U Clf, then \i G V . Let V denote the convex closure 
of V; that is, V is the smallest closed convex set of probability measures 
containing V . It is easy to see that E_ v = Ejp and E-p = E^; adding a 
convex combinations of probability measure to V does not affect the lower 
expectation, nor does closing off V under limits. The converse holds as well. 



E V (X) 



E v (-X). 



E Vz {Y) 



1 and E Vz {Y) = 1. I 



Theorem 6.4: E Vl = E V2 iffVi = V 2 - 
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Thus, there is a one-to-one map between closed, convex sets of probability 
measures and lower expectation functions. This shows that lower expecta- 
tions are essentially as good as sets of probability measures as representations 
of uncertainty. Walley [1991] provides a detailed account of the use of lower 
and upper expectations as a representation of uncertainty. (He calls them 
coherent lower and upper previsions.) 

Lower and upper expectation have a rather elegant characterization, sim- 
ilar in spirit to (but simpler than) the characterization of lower and upper 
probability. The following result collects some properties of lower and upper 
expectation, all of which are easy to verify. 

Proposition 6.5: The functions Ep and E_ v have the following properties, 
for all gambles X and Y . 

(a) E v is subadditive: E V {X + Y) < E P (X) + E V (Y); 
E v is superadditive: E V (X + Y)> E V (X) + E V {Y). 

(b) Ep and Ep are both positively affmely homogeneous: Ep(aX + b) = 
aE v (X) + b and E P (aX + b) = aE v (X) + b if a,b E R, a > 0. 

(c) E v and Ep are monotone. 

(d) Ep(X) = -E v (-X). 

Superadditivity (resp., subadditivity), positive affine homogeneity, and 
monotonicity in fact characterize E_ v (resp., Ep). 

Theorem 6.6: [Huber 1981] Suppose that E maps gambles on W to M 
and is superadditive (resp., subadditive) , positively affinely homogeneous, and 
monotone. Then there is a set V of probability measures on W such that 
E = E P (resp., E = Ep). 

The set V constructed in Theorem 6.6 is not unique. But it follows from 
Theorem 6.4 that there is a unique closed convex set V such that E = E_ v . 
V is actually the largest set of probability measures V such that E = Ep,, 
and consists of all probability measures \i such that E^(X) > E(X) for all 
gambles X. 
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7 Decision Making 



One of the standard uses of a representation of uncertainty is to help make 
decisions. Savage [1954] formalizes the decision process by considering a set 
W of possible worlds (sometimes called states), a set C of consequences, 
and a set A of acts, which are functions from worlds to consequences. For 
example, if an agent is trying to decide how to bet on a horse race, the 
worlds could represent the order in which the horses finished the race, and 
the consequences could be amounts of money won or lost. The consequence 
of a bet of $10 on Northern Dancer depends on how Northern Dancer finishes 
in the world. So the bet is an act that maps worlds (which describe possible 
orders of finish) to consequences. The consequence could be purely monetary 
(the agent wins $50 in the worlds where Northern Dancer wins the race) but 
could also include feelings (the agent is dejected if Northern Dancer finishes 
last, and he also loses $10). 

Savage [1954] assumes that the agent has a preference order y on acts, 
where a\ y a 2 means that a,\ is at least as good as a 2 from the point of 
view of the agent. He shows that if the preference order satisfies certain 
postulates, then the agent is acting as if she has a probability /x on worlds, a 
utility function u mapping consequences to reals, and is maximizing expected 
utility; that is, a\ y a 2 iff the expected utility of a\ is at least as high as the 
expected utility of a 2 . 

Savage viewed his postulates as rationality postulates; an agent would be 
irrational if her preferences violated the postulates. However, as I discussed 
earlier, in the situation described by Example 1.2, experimental evidence 
(see [Kagel and Roth 1995]) shows that most people prefer the bet B r to B b 
and also prefers B by to B ry . These preferences are inconsistent with Savage's 
postulates. Indeed, there does not exist a utility that can be placed on the 
two possible consequences (getting $1 and getting 0) and a probability that 
can be placed on {b, r, y} such that these preferences correspond to the order 
induced by expected utility. 

On the other hand, these preferences can be captured using lower ex- 
pected utility, an approach considered by Wald [1950], Gardenfors and Sahlin 
[1982], and Gilboa and Schmeidler [1989], among others. Taking the obvious 
set V u of probability measures described after Example 1.2 and giving utility 
1 to winning $1 and utility to getting 0, it is easy to see that the lower 
expected utility of act B r is .3, the lower expected utility of act B b is 0, the 
lower expected utility of B ry is also .3, and the lower expected utility of B by 
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is .7. Thus, if the agent prefers the act whose lower expected utility is larger, 
then she would indeed prefer B r . to and prefer B^y to B ry . 

Gilboa and Schmeidler [1989] provide a collection of postulates that char- 
acterize decision making with lower expected utility in the spirit of Savage's 
postulates. Of course, it is debatable whether these postulates represent "ra- 
tionality" any better than Savage's do. However, they do undercut the claim 
that Savage's postulate characterize rationality. 

Using lower expected utility corresponds to the preference order bp on 
acts such that a bp a' iff Ep(u a ) > E_ v (u a r). But this is not the only 
preference rule that can be used if uncertainty is represented using a set V 
of probabilities. Other orders can be defined as well: 

• a bp a' iff E v {u a ) > E v (u a >); 

• a bp a' iff E<p(u a ) > E P (u a >); 

• a^a' iff £?„(u a ) > ^(v) for all neV. 

Of course, all of these preference orders reduce to the order provided by 
maximizing expected utility if V is a singleton. But in general they are quite 
different. The order on acts induced by bp is very conservative; a bp a' 
iff the best expected outcome according to a is no better than the worst 
expected outcome according to a'. The order induced by bp is more refined. 
Clearly if a bp a', then E^Ug) > E^u^) for all \i e V, so a bp a'. The 
converse may not hold. For example, suppose that V = {//,//}, and acts a 
and a' are such that E^{u a ) = 2, E^(u a ) = 4, E^(u a i) = 1, and ^/(v) = 3. 
Then E_ v (u a ) = 2, Ep(u a ) = 4, E v (u a i) = 1, and E v (u a ') = 3, so a and a' 
are incomparable according to bp, yet a bp a'. 

Which of these rules is the "right" one? We can think of bp as rep- 
resenting a very pessimistic agent (who considers only the worst case); bp 
represents an optimistic agent; while bp represents an agent who considers 
all possibilities. (I find bp too conservative, and believe that bp is a better 
choice than bp-) Note that while bp and bp place a total order on acts, 
the ordering bp is only partial; some acts will be incomparable under bp- 

8 Conclusion 

I have provided a brief overview of some of the issues that arise when rep- 
resenting uncertainty by sets of probabilities, with a particular focus on up- 
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dating and decision making. Before concluding, I briefly mention two other 
issues that may be of interest: 

• There are propositional logics for reasoning about about probability 
and Dempster- Shafer belief functions [Fagin, Halpern, and Megiddo 
1990]. More recently logics have been provided for reasoning about 
lower and upper probabilities [Halpern and Pucella 2002a] and lower 
and upper expectations [Halpern and Pucella 2002b]. The syntax of 
the logics for reasoning about probability belief functions, and lower 
and upper probability are all the same. All include statements such as 
2 /31(f) + 3/Al(ip) > 1/2, where tp and if) are propositional formulas. 
The "1" here stands for "likelihood". Thus, this statement says 2/3 
times the likelihood of tp plus 3/4 times the likelihood of ip is at least 
1/2. "Likelihood" can be interpreted as either probability, belief, or 
lower probability. In the latter case, the upper probability of tp can be 
expressed as 1 — l(-^tp). (In the case of belief, the same formula defines 
the plausibility of tp.) 

The syntax for the logic of expectation is similar in spirit. It in- 
cludes formulas of the form 2/3e(^) + 3/4e(Y) > 1/2, where 7 and 
7' are propositional gambles. A propositional gamble has the form 
a,\tp x + • ■ • + a^tpki where ai, . . . , are real numbers and tp 1 , . . . , tp k are 
propositional formulas. This propositional gamble is interpreted as the 
gamble aiX^j + • • • + a,kXy k j , where {tpj} is the set of worlds where tp 
is true. Thus, a propositional gamble such as 2tp + 3if> is interpreted as 
the gamble 2X[ v j + 3X^j , which returns 5 in worlds where both tp and 
if> are true, 2 in worlds where tpA—iip is true, and so on. Again, different 
interpretations of e are allowed; it can be interpreted as probabilistic 
expectation, expected belief (see [Halpern 2003] for a definition of ex- 
pected belief), or lower expectation (in which case upper expectation 
can be defined in the obvious way). 

The axioms of the logics depend on the interpretation of I and e. In all 
cases, there is an elegant sound and complete axiomatization. In the 
case of lower and upper probabilities (resp., lower and upper expec- 
tations), not surprisingly, the key axioms are those corresponding to 
the properties described in Theorem 3.1 (resp., Theorem 6.6). More- 
over, not only are the logics decidable, but the satisfiability problem is 
NP-complete in all cases, the same as that of propositional logic (and 
of the logic for reasoning about probability). Reasoning about lower 
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and upper probability (resp., expectation) is thus, in a precise sense, 
no more difficult than propositional reasoning. 

• Bayesian networks provide a compact way of representing probability 
measures, taking advantage of independencies and conditional inde- 
pendencies. There has been a great deal of work in the AI community 
showing how Bayesian networks can be used for efficient probabilistic 
reasoning (see [Pearl 1988] for an overview). We can define what it 
means for U and V to be conditionally independent with respect to a 
set V of probability measures. Roughly speaking, U and V are inde- 
pendent with respect to V if fj,(V \ U) = //(V) for all // e V (special 
care must be taken to deal with the case that n(U) = 0; see [Halpern 
2001] for details). Conditional independence is defined in the same 
way. Once we do this, then the whole technology of Bayesian networks 
can be applied to sets of probabilities, essentially without change; see 
[Halpern 2001] for details. 

As this discussion shows, using sets of probabilities provides a flexible 
way of representing uncertainty that enables an agent to represent ignorance 
as well as likelihood, while still retaining many of the pleasant features of 
using just a single probability measure to represent uncertainty. 

Acknowledgments: Thanks to Franz Huber for a careful reading of the 
paper and useful comments. 
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